Final Project: NYC Taxi Data

Introduction


Data

This project relies on data from New York City’s Taxi and Limousine Comission (TLC). NYC publishes this TLC data for all trips taken by Yellow Taxis, Green Taxis, For Hire Vehicles, and High Volume for High Vehicles. We rely on the Yellow Taxi data, as this is the transportation method most people use and are familiar with. NYC makes full trip data available starting in 2009, organized by month. Each month contains data on roughly 7 million trips. Given the size of this data, we are choosing to work with only data from 2019. A significant amount of data is available for each trip. The dataset contains information on: pickup and droppoff times, pickup and dropoff locations, rate cod, payment tipe, fare amount, credit card tips, and total amount.

Project Team

For this project, Jae and Andrew received permission from Professor Brambor to work in a group of two. The two of us had been planning on this project using traffic data from New York City before this semester began. We also plan to expand on the scope of this project during the summer by implementing machine learning algorthims to build predictive models for this dataset. Given these factors, and our shared interest in this topic, we thought working in a group of two is most effective.

Website Content

Our website is primarily split into two major portions. For the first part of our visualization, we used various forms of ggplots to demonstrate insightful points for duration of trips. This was done in the form of time series using a combination of aggregation and tailored configuration for each plot. For the second part of our visualization, we used variables that are most suited and ideal to be illustrated on maps for insightful analysis. This was done through leaflet maps with the implementation of aggregated data and custom feature additions.

Please keep in mind that all the graphs are interactive in nature. Instead of focusing a specific variable, our group thought it would be more useful to give user the control to use the already-available insights from the visualization based on his/her individual needs, either in the perspective of the consumer or taxi driver. Our team would also like to point out that all the preprocessing and data scrubbing, as well as exploratory data analysis, are included as part of our process book, and is excluded from this section. This website only includes our final data viusalization outputs, in accordance with the instructions laid out in the class website.

When aggregating data, our team used median for all variables, apart from tip amounts. This was done because the data contained a reasonable number of outliers that might skew the average during aggregation. As such, we determined median was the best method to most accurately capture the insights from the data relevant to the variables of interest.


Time Series Analysis using ggplots

Using ggplots, our group first decided to focus on the duration of the trip. We think there is an interesting story to be told from such variable. From a consumer perspective, it will be useful for someone to know when the trip will be shortest at what hour of the day, what day of the week or at what month of the year. Vice versa, it will also be useful for a cab driver to know when the trip will be the longest or shortest. Even though the same approach could have been incorporate for trip distance and fare amount, we thought this can be best shown on the map, instead of graphs. This is shown in our next section.

From the graph above, the peak rush hour time period is shown to be from 9 a.m. to 6 p.m. with the median duration hovering around 12 minutes, and the duration gradually declines after 6 p.m. The lowest points are around 5 a.m. or 6 a.m.. This is reasonable considering that most people work from 9 a.m. to 6 p.m., and people frequently use cabs throughout the working period in the bustling city of New York.

Our group further dissected the duration over a day by comparing its trends by the day of the week. One can clearly see how weekdays follow a similar trend of following a sharp increase in duration of trips from 7 a.m. to 12 p.m., which is followed by a gradual decline. This effect is similar to the aggregated graph above, but the effect is more pronounced and the rush hour seems to start earlier. However, there is a clear difference in this trend when looking at the weekends; Saturday and Sunday both follow a gradual increase from 7 a.m. to 7 p.m. (Saturday) or 3 p.m. (Sunday). As such, one can clearly see people start their day more slowly on these days.

Note: you can selectively click on a single day by double clicking on the desired day of the week on the legend box of the interactive map. This would apply to any interactive maps with the legend embedded within the graph.

Now, we shifted gears to study how the duration varies by the day of the week. This almosts seems like a normal distribution curve with Monday and Sunday having the two lowest duration (10.5 and 10 minutes), while Thursday marks the highest peak with around 12 minutes in duration. This trend was also reflected in the overall trend in previous graph when comparing day to day with Thursday having the highest duration on average, while Sunday had the lowest.

Our group thought it would also be interesting to analyze the duration of weekdays by month, and compare month to month. It seems like June and October consistently share the highest peak throughout the days of the week, in general. This may be attributed to major holidays (i.e. summer break) or the start of schools (i.e. fall semester for universities), where people tend to travel more than other months of the year. In terms of day to day, thursday seem to have the highest duration of trips over any other day of the week, and Sunday have the lowest. This pattern is consistent with what we saw with the graph above.



Trip Analysis using Leaflet Maps

The following maps present different aspects of Yellow Taxi activity within New York City. These maps are divided into Taxi Zones defined by the city. Map visualizations have the advantage of clearly highlighting features by geography. The first set of maps examines the tipping behavior of passengers organized by pick up zone. The next set looks more broadly at NYC traffic patterns, analyzing which zones were more congested. The final map examines taxi activity as a whole.

Map 1 - Average Tip by Pickup Zone

This map displays the average credit card tip amount for each NYC Yellow Taxi pickup zone. This data was filtered to only include trips payed by credit card, as the NYC TLC dataset does not record cash tips. Interestingly, Manhattan pickup zones have relatively low average tips, while the other boroughs appear to higher average tips generally.

Map 2 - Average Tip Percent

Though mapping average tip data effectively shows the trend of which zones tip higher, a more accurate method of analyzing tips is to analyze by tip percent, since most riders generally tip a percentage of the total trip cost. This map below displays the average tip percent by taxi zone. This map tells a similar story to the previous visualization. Fun Fact: the average tip percentage in our dataaset for 2019 is 18%.

Map 3 - Median Speed

This map displays the median speed for each taxi zone. Since exact trip route data is not available trips were filtered to include a pickup and a dropoff within the same taxi zone in order to more accurately identify which zones were more congested.

Map 4 - Average Trip Cost

This map uses the same data as the previous map, which selects only trips that have the same pickup and drop off zones. This map displays the average total cost of a trip (minus the tips) for each zone. Given that taxi meters rely on a combination of distance and speed, this map serves as another proxy measure of congestion. The more expensive a trip, the further the taxi traveled and the longer the trip.

Map 5 - Pickup and Dropoff Volume

This final map displays taxi activity. Use the layer filters to display either Pickup or Drop Off Volumes. Click on a taxi zone to display the number of pickups and dropoffs in our dataset. Please note: this data relies on a subset of the entire dataset, so the count totals are only a portion of total trips, while the relative data is accurate. Unsurprisingly, Manhattan has many active pickup and dropoff zones. Similarly, JFK Airport and LaGuardia Airport are some of the busiest areas.

ui <- fluidPage(
  titlePanel("Traffic Volumn Analysis"),
  mainPanel(
    leafletOutput("map"),
    br(), br()),
    sidebarPanel(
    ### User chooses the species to map
    selectInput("location", "Starting Area Location",
      unique(all_months$PULocationID))
  ))

server <- function(input, output, session) {
  output$map <- renderLeaflet({
    filtered_df <- 
      all_months %>%
      filter(PULocationID == input$location) %>%
      group_by(DOLocationID) %>%
      tally()
    
    #create popup contents
    content_do <- paste("Neighborhood:", taxi_zones$zone, "<br/>",
                     "Number of Dropoffs:", filtered_df$n, "<br/>")
    #create map
    leaflet(filtered_df) %>%
      addTiles() %>%
      setView(lng = -73.98928, lat = 40.75042, zoom = 10.2) %>%
      addProviderTiles("CartoDB.Positron") %>%
        addPolygons(
        data = proj,
                  popup = content_do,
                  weight = 1,
                  fillColor = ~colorQuantile("Blues", filtered_df$n)(filtered_df$n),
                  fillOpacity = 1,
                  highlightOptions = highlightOptions(
                    color='#000000',
                    weight = 3,
                    bringToFront = TRUE,
                    sendToBack = TRUE),
                  label = taxi_zones$zone) %>%
      addLegend("topright",
                pal = colorQuantile("Blues", test$n, n = 5),
                values = filtered_df$n,
                title = "Drop Off Volume",
                opacity = 1)
  })
}

shinyApp(ui, server)
## PhantomJS not found. You can install it with webshot::install_phantomjs(). If it is installed, please make sure the phantomjs executable can be found via the PATH variable.
Shiny applications not supported in static R Markdown documents